import besca as bc
import scanpy as sc
import pkg_resources
import os
The datasets that are already annotated and should be used for training. If you only use one dataset please use list of one.
adata_trains = [bc.datasets.Martin2019_processed()]
The dataset of interest that should be annotated.
adata_pred = sc.read(pkg_resources.resource_filename('besca', 'datasets/data/Smillie2019_processed.h5ad'))
adata_orig = sc.read(pkg_resources.resource_filename('besca', 'datasets/data/Smillie2019_processed.h5ad'))
Give your analysis a name.
analysis_name = 'auto_annot_Smillie2019_with_Martin2019_dblabel'
Specify column name of celltype annotation you want to train on.
celltype ='dblabel'
Choose a method:
method = 'logistic_regression'
Specify merge method if using multiple training datasets. Needs to be either scanorama or naive.
merge = 'scanorama'
Decide if you want to use the raw format or highly variable genes. Raw increases computational time and does not necessarily improve predictions.
use_raw = False
You can choose to only consider a subset of genes from a signature set.
genes_to_use = 'all'
This function merges training datasets, removes unwanted genes, and if scanorama is used corrects for datasets.
adata_train, adata_pred = bc.tl.auto_annot.merge_data(adata_trains, adata_pred, genes_to_use = genes_to_use, merge = merge)
merging with scanorama using scanorama rn Found 1054 genes among all datasets [[0. 0.55163821] [0. 0. ]] Processing datasets (0, 1) integrating training set calculating intersection
The returned scaler is fitted on the training dataset (to zero mean and scaled to unit variance).
classifier, scaler = bc.tl.auto_annot.fit(adata_train, method, celltype)
[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers. [Parallel(n_jobs=10)]: Done 5 out of 5 | elapsed: 9.2min finished
Use fitted model to predict celltypes in adata_pred. Prediction will be added in a new column called 'auto_annot'. Paths are needed as adata_pred will revert to its original state (all genes, no additional corrections). The threshold should be set to 0 or left out for SVM. For logisitic regression the threshold can be set.
adata_predicted = bc.tl.auto_annot.adata_predict(classifier = classifier, scaler = scaler, adata_pred = adata_pred, adata_orig = adata_orig, threshold = 0)
Write out metrics to a report file, create confusion matrices and comparative umap plots
%matplotlib inline
bc.tl.report(
adata_pred=adata_predicted,
celltype=celltype,
method=method,
analysis_name=analysis_name,
train_datasets = adata_trains,
test_dataset = adata_orig,
merge = merge,
name_prediction='auto_annot',
name_report='auto_annot',
use_raw=use_raw,
remove_nonshared=True,
clustering='leiden',
asymmetric_matrix=True,
delimiter='\t',
verbose=True
)
acc: 0.1
f1: 0.08
ami: 0.6
ari: 1.99
silhouette dblabel: 0.14
silhouette auto_annot: 0.07
pair confusion matrix:
0 1
0 17858807838 1309041414
1 1474199868 1777472972
... storing 'auto_annot' as categorical
WARNING: saving figure to file figures/umap.ondata_auto_annot_Smillie2019_with_Martin2019_dblabel.png
WARNING: saving figure to file figures/umap.auto_annot_Smillie2019_with_Martin2019_dblabel_dblabel.png
WARNING: saving figure to file figures/umap.auto_annot_Smillie2019_with_Martin2019_dblabel_auto_annot.png
WARNING: saving figure to file figures/umap.auto_annot_Smillie2019_with_Martin2019_dblabel_leiden.png
sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot'])
sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot'], legend_loc='on data', legend_fontsize=6)
sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot'], legend_fontsize=7, wspace = 1.4, save = '.svg')
sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot'], legend_loc='on data', legend_fontsize=7, wspace = 1.4, save = '.ondata.svg')
WARNING: saving figure to file figures/umap.svg
WARNING: saving figure to file figures/umap.ondata.svg
adata_train
View of AnnData object with n_obs × n_vars = 62202 × 1054
obs: 'CELL', 'CONDITION', 'Sample_geo_accession', 'Sample_title', 'Subject', 'tissue', 'status', '10x chemistry', 'Sample_relation', 'Sample_relation_2', 'Sample_supplementary_file_1', 'Sample_supplementary_file_2', 'Sample_supplementary_file_3', 'ID_REF', 'Barcode', 'Type', 'Cluster', 'Lane', 'Subtype', 'percent_mito', 'n_counts', 'n_genes', 'batch', 'leiden', 'clusterID', 'cell_names', 'cell_group', 'scell_group', 'sscell_group', 'dblabel', 'celltype', 'cluster_celltype'
var: 'SYMBOL', 'ENSEMBL-0', 'n_cells-0', 'total_counts-0', 'frac_reads-0', 'ENSEMBL-1', 'n_cells-1', 'total_counts-1', 'frac_reads-1'
uns: 'Subtype_colors', 'Type_colors', 'cell_group_colors', 'cell_names_colors', 'leiden', 'leiden_colors', 'scell_group_colors', 'sscell_group_colors', 'umap'
obsm: 'X_pca', 'X_umap', 'X_scanorama'
# Compare to random assignment
import random
random.seed(1)
adata_predicted.obs['random_labeling'] = list(adata_predicted.obs[celltype].sample(frac=1))
bc.tl.report(
adata_pred=adata_predicted,
celltype=celltype,
method="compare_to_random_" + method,
analysis_name=analysis_name,
train_datasets = adata_trains,
test_dataset = adata_orig,
merge = merge,
name_prediction="random_labeling",
name_report="compare_to_random_auto_annot",
use_raw=use_raw,
remove_nonshared=False,
clustering='leiden',
asymmetric_matrix=True,
delimiter='\t',
verbose=True)
acc: 0.15
f1: 0.15
ami: 0.0
ari: -0.01
silhouette dblabel: 0.14
silhouette random_labeling: -0.07
pair confusion matrix:
0 1
0 16388624276 2779224976
1 2779224976 472447864
... storing 'random_labeling' as categorical
WARNING: saving figure to file figures/umap.ondata_auto_annot_Smillie2019_with_Martin2019_dblabel.png
WARNING: saving figure to file figures/umap.auto_annot_Smillie2019_with_Martin2019_dblabel_dblabel.png
WARNING: saving figure to file figures/umap.auto_annot_Smillie2019_with_Martin2019_dblabel_random_labeling.png
WARNING: saving figure to file figures/umap.auto_annot_Smillie2019_with_Martin2019_dblabel_leiden.png
from sinfo import sinfo
sinfo()
----- anndata 0.7.5 besca 2.4+57.g5ad53b2 pkg_resources NA plotly 4.14.3 scanpy 1.6.1 sinfo 0.3.1 sklearn 0.24.1 ----- IPython 7.20.0 jupyter_client 6.1.11 jupyter_core 4.7.1 notebook 6.2.0 ----- Python 3.7.9 | packaged by conda-forge | (default, Dec 9 2020, 21:08:20) [GCC 9.3.0] Linux-3.10.0-693.11.6.el7.x86_64-x86_64-with-centos-7.4.1708-Core 24 logical CPU cores, x86_64 ----- Session information updated at 2021-07-18 09:28
%%javascript
IPython.notebook.kernel.execute('nb_name = "' + IPython.notebook.notebook_name + '"')
nb_name = os.path.join(os.getcwd(), nb_name)
! jupyter nbconvert --to html {nb_name}
[NbConvertApp] Converting notebook /pstore/data/bioinfo/users/hatjek/devel/besca_publication_results/intestine/auto_annot/auto_annot_Smillie2019_with_Martin2019_dblabel.ipynb to html [NbConvertApp] Writing 13729484 bytes to /pstore/data/bioinfo/users/hatjek/devel/besca_publication_results/intestine/auto_annot/auto_annot_Smillie2019_with_Martin2019_dblabel.html